Business Analytics

Advanced Data Visualizations

Ayush Patel and Jayati Sharma

16 March, 2024

Pre-requisite

You already….

  • Know basic and advanced data wrangling functions in R
  • Know basics of data visualization in R
  • Can write functions in R

Before we begin

Please install and load the following packages

library(dplyr)
library(tidyverse)
library(scales)
library(patchwork)
library(gghighlight)
library(ggiraph)
library(ISLR2)
library(openintro)
library(janitor)
library(gapminder)
library(palmerpenguins)



Access lecture slide from the course landing page

About me

I am Ayush.

I am a researcher working at the intersection of data, law, development and economics.

I teach Data Science using R at Gokhale Institute of Politics and Economics

I am a RStudio (Posit) certified tidyverse Instructor.

I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.

Reach me

ayush.ap58@gmail.com

ayush.patel@gipe.ac.in

Learning Objectives

  • Learn annotation for graphs in R
  • Learn how to combine graphs
  • Learn scaling functions in R
  • Learn how to make ggplot graphs interactive

Let’s Recap

  • In the data visualization lecture, you learnt how to create various types of graphs using ggplot2
  • Some of them include bar, graph, line graph, scatter plots etc
  • For effective data visualization and communication, any plot requires modifications
  • These include annotations on the plot, modification of axes and scales, highlighting and interactivity of the plot
  • The aim of this lecture is to move beyond making graphs, towards clear and effective visualizations

Annotations in ggplot - Text

Content for this topic has been sourced from ggplot2. Please check out the work for detailed information.

  • In addition to plotting your graph, you want to provide additional details to explain your graph
  • Text annotations are useful in this case
  • The annotate() function can be used for any kind of geometric object
  • In the annotate() function, the type of geom is specified first
  • Then, the positining is required (x and y coordinates in this case)
  • This is followed by the label
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
   annotate("text", x = 4, y = 25, label = "Annotation Text")

Annotations in ggplot

Content for this topic has been sourced from ggplot2. Please check out the work for detailed information.

  • Further, annotations can be customized
 ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
   annotate("text", x = 4, y = 25, label = "Annotation Text", colour = "orange", size = 8)

 ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
   annotate("text", x = 1:5, y = 6, label = "Annotation Text", colour = "orange", size = 3)

Annotations in ggplot

Content for this topic has been sourced from ggplot2. Please check out the work for detailed information.

  • Similar to text annotation, other geoms can be used for annotations
  • However, instead of x and y, xmin and xmax is used for coordinates of the rectangle
  • Do you remember what alpha is used for?
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
   annotate("rect", xmin = 4.8, 
            xmax = 5.7,
            ymin = 10,
            ymax = 18.6, 
            alpha = .2)

Annotations in ggplot

Content for this topic has been sourced from ggplot2. Please check out the work for detailed information.

  • Suppose you want to add a line segment to your graph
  • annotate() over here requires x and xend coordinates
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
   annotate("segment", x = 4.8,
            xend = 5.7,
            y = 10,
            yend = 18.6,
            colour = "red")

Do it Yourself -1

  • Load the Auto data from ISLR2 package
  • Make a scatterplot of horsepower and acceleration
  • To the plot, add the text Horsepower vs. Acceleration
  • Add a rectangle to the plot, such that it covers the area where horsepower is higher than 200 but acceleration is still lesser than 15
  • Add a line to the plot from the coordinates (50,10) to (150,20)

Scales Functions in ggplot2 - Why?

  • When you create a graph, using ggplot2, the axes are mapped automatically based on the data
  • However, you would often need to change the axes in order to effectively present the data
  • the scale functions in ggplot2:
    • control how the data is plotted
    • allow manipulation of axes
    • improves overall appearances of the plot for effective data communication

Scales Functions in ggplot2

  • Look at the scatter plot of wt and mpg
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()

  • What if you want both the axes to start from 0?
  • scale_y_continuous() allows you to set the range for the y-axis
  • limits inside the scale_y_continuous() provides limits of the scale
  • Over here, NA is used to refer to the existing maximum
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
  scale_y_continuous(limits = c(0, NA))

Scales Functions in ggplot2

  • Instead of using NA, if you had to provide 40 as the limit for y-axis
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
  scale_y_continuous(limits = c(0, 40))

Scales Functions in ggplot2

  • Setting breaks in the scale_y_continuous allows you to set what intervals the axis will have
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
  scale_y_continuous(breaks = seq(0, 40, 7))

Scales Functions in ggplot2

  • Similarly, there are other transformations for scale - reversing the scale
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
  scale_y_reverse()

  • scale_y_log10() does log transformation of the scale
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
  scale_y_log10()

Scales Functions in ggplot2

Content for this topic has been sourced from ggplot2. Please check out the work for detailed information.

  • The scale_colour_brewer() options are useful for plotting discrete values on your graph
  • The brewer scales provide sequential colour schemes from ColorBrewer
  • Look at the two charts
  • scale_colour_brewer helps in effcient mapping of discrete variables
ggplot(mpg, aes(x = displ, y = cty)) +
   geom_point(aes(colour = class))

ggplot(mpg, aes(x = displ, y = cty)) +
   geom_point(aes(colour = class))+
  scale_colour_brewer()

Do It Yourself -2

  • Using the Auto data, plot a scatterplot between y = weight and x = displacement
  • Set the x-axis with breaks as 50
  • Set the breaks for y-axis as 5
  • What variable according to you can be used as the colour for the points? How?

Scales Package

  • The scales package many scaling functions for visualizations
  • It allows for sophisticated customisation of data in a plot
  • Functions for readable and informative axes

Scales

Content for this topic has been sourced from scales. Please check out the work for detailed information.

  • Look at the following chart made using txhousing data
  • We want to make it more readable and clear
  • The number of zeroes on the y-axis can be reduced along with the way years are represented on the x-axis
txhousing %>% 
  mutate(date = make_date(year, month, 1)) %>% 
  group_by(city) %>% 
  filter(min(sales) > 500) %>% 
  ggplot(aes(date, sales, group = city)) + 
  geom_line(na.rm = TRUE)

Scales

Content for this topic has been sourced from scales. Please check out the work for detailed information.

  • Similar to the scale functions in ggplot2, the scales package has functions for breaks and labels
  • the breaks_width function provides a way to show every two years on the axis, while the label_date provides a way to show the last two digits of the year using %y, making it more clear
  • On the y-axis. the cut_short_scale() function removes the additional 0 and supplements the K sign
txhousing %>% 
  mutate(date = make_date(year, month, 1)) %>% 
  group_by(city) %>% 
  filter(min(sales) > 500) %>% 
  ggplot(aes(date, sales, group = city)) + 
  geom_line(na.rm = TRUE) + 
  scale_x_date(
    NULL,
    breaks = scales::breaks_width("2 years"), 
    labels = scales::label_date("'%y")) + 
  scale_y_log10(
    "Total sales",
    labels = scales::label_number(scale_cut = scales::cut_short_scale()))

Scales

Content for this topic has been sourced from scales. Please check out the work for detailed information.

  • Let us try modifying another graph using economics data
economics %>% 
  filter(date < ymd("1970-01-01")) %>% 
  ggplot(aes(date, pce)) + 
  geom_line()

  • How can this be made more readable?
  • For x-axis, one option is to show the dates along with a few months, for better insights
  • For the y-axis, a label that adds the dollar sign would make the chart more readable

Scales

Content for this topic has been sourced from scales. Please check out the work for detailed information.

  • breaks_width sets intervals for 3 months
  • However, you might want the axes to have the date format in months
  • label_date_short() does the task of shortening the date lengths
economics %>% 
  filter(date < ymd("1970-01-01")) %>% 
  ggplot(aes(date, pce)) + 
  geom_line() + 
  scale_x_date(NULL,
    breaks = scales::breaks_width("3 months"), 
    labels = scales::label_date_short())

Scales

Content for this topic has been sourced from scales. Please check out the work for detailed information.

  • For the y-axis, you can set breaks as you desire using breaks_extended()
  • label_dollar() adds a dollar sign to the y-axis
economics %>% 
  filter(date < ymd("1970-01-01")) %>% 
  ggplot(aes(date, pce)) + 
  geom_line() + 
  scale_x_date(NULL,
    breaks = scales::breaks_width("3 months"), 
    labels = scales::label_date_short()) + 
  scale_y_continuous("Personal consumption expenditures",
    breaks = scales::breaks_extended(8),
    labels = scales::label_dollar())

Do it Yourself -3

  • Load the tourism data from openintro
  • Make a line graph of year and tourist spending
  • Is there any change you could make to the chart for better readability?

Patchwork

  • You have made multiple by now and want to combine them into the same graphic
  • A very easy way to do this by using patchwork
  • Let us learn this using our recently made plots
p1 <- economics %>% 
  filter(date < ymd("1970-01-01")) %>% 
  ggplot(aes(date, pce)) + 
  geom_line()

p2 <- economics %>% 
  filter(date < ymd("1970-01-01")) %>% 
  ggplot(aes(date, pce)) + 
  geom_line() + 
  scale_x_date(NULL,
    breaks = scales::breaks_width("3 months"), 
    labels = scales::label_date_short())

p3 <- economics %>% 
  filter(date < ymd("1970-01-01")) %>% 
  ggplot(aes(date, pce)) + 
  geom_line() + 
  scale_x_date(NULL,
    breaks = scales::breaks_width("3 months"), 
    labels = scales::label_date_short()) + 
  scale_y_continuous("Personal consumption expenditures",
    breaks = scales::breaks_extended(8),
    labels = scales::label_dollar())

Patchwork

  • The usage of patchwork is very simple: you literally just add plots together!
p1 + p2

  • You can also put the plots one below the other
p1 / p2

Patchwork

  • While plots p1 and p2 show the intermediate steps, p3 is the final plot
  • It would be better to have the two at the top and the final one at the bottom
(p1 + p2) / p3

Patchwork

  • After combining all the plots, you would want to modify all plots at once
patchwork <- (p1 + p2) / p3
patchwork & theme_minimal()

Do it Yourself - 4

  • From the tourism data, make line charts of year and visitor_count_tho, one for each decade
  • Combine these charts in such a way that at the top, the graph for all years is displayed and below it, there are 5 charts, one for each decade

Highlight information - gghighlight()

Content for this topic has been sourced from Hiroaki Yutani’s work. Please check out the work for detailed information.

  • Run the following code to generate a random dataset
set.seed(2)
data <- purrr::map_dfr(letters, ~ data.frame(
      id = 1:500,
      value = cumsum(runif(500, -5, 5)),
      type = .,
      flag = sample(c(TRUE, FALSE), size = 500, replace = TRUE),
      stringsAsFactors = FALSE))
  • Suppose you want to plot the value of each id
ggplot(data) +
  geom_line(aes(x= id, y = value, colour = type))

Highlight information - gghighlight()

Content for this topic has been sourced from Hiroaki Yutani’s work. Please check out the work for detailed information.

  • Too much clutter right?
  • You can highlight only the lines with the biggest value
data_filtered <- data %>%
  group_by(type) %>% 
  filter(max(value) > 112) %>%
  ungroup()

ggplot(data_filtered) +
  geom_line(aes(id, value, colour = type))

Highlight information - gghighlight()

Content for this topic has been sourced from Hiroaki Yutani’s work. Please check out the work for detailed information.

  • However, that takes away the context of our visualization
  • Also, very tiresome
  • gghighlight() comes handy here
  • Think of it as the filter() equivalent of ggplot2
ggplot(data) +
  geom_line(aes(id, value, colour = type)) +
  gghighlight(max(value) > 112)

  • You can customise plots as well
ggplot(data) +
  geom_line(aes(id, value, colour = type)) +
  gghighlight(max(value) > 112)+
  theme_minimal()

Highlight information - gghighlight()

Content for this topic has been sourced from Hiroaki Yutani’s work. Please check out the work for detailed information.

  • Similarly, you can highlight other types of geoms as well
ggplot(mpg)+
  geom_point(aes(displ, cty))+
  gghighlight(displ >= 5)

  • You can use more than one condition for gghighlight
ggplot(mpg)+
  geom_point(aes(displ, cty))+
  gghighlight(displ >= 5, cty >= 15)

Highlight information - gghighlight()

Content for this topic has been sourced from Hiroaki Yutani’s work. Please check out the work for detailed information.

  • You can select the number of values you want to highlight using max_highlight
ggplot(data) +
  geom_line(aes(id, value, colour = type)) +
  gghighlight(max(value), max_highlight =  5L)

Do it Yourself - 5

  • Use txhousing data from ggplot2 and make a new dataset called txhousing_sales which calculates the total sales of each city for the year
  • Now, make a line chart of total sales over the years for all cities. Highlight the top 4 cities
  • From txhousing_sales, filter for the year 2015 and make a bar chart which shows the total_sales for all cities. Highlight the top 5

Creating interactive Visualizations - ggiraph()

Content for this topic has been sourced from Albert Rapp’s work. Please check out the work for detailed information.

  • Til now, we have learnt so many ways of making effective visualizations
  • When we add interactivity to our plots, it would make it very easy for people to focus on what they find important
  • ggiraph helps turn our graphs into an interactive visualization
  • It can be combined with many other features to connect different visualizations and communicate data-driven insights
  • Run the following code for data preperation
gapminder_data <- gapminder::gapminder %>%
  janitor::clean_names() %>%
  mutate(id = levels(continent)[as.numeric(continent)],
    continent = forcats::fct_reorder(continent, life_exp))

mean_life_exps <- gapminder_data %>% 
  group_by(continent, year, id) %>%
  summarise(mean_life_exp = mean(life_exp)) %>%
  ungroup()

Creating interactive Visualizations - ggiraph()

Content for this topic has been sourced from Albert Rapp’s work. Please check out the work for detailed information.

  • Let us make a simple line chart first
line_chart <- mean_life_exps %>%
  ggplot(aes(x = year, y = mean_life_exp, colour = continent)) +
  geom_line(linewidth = 2.5) +
  geom_point(size = 3.5) +
  theme_minimal(base_size = 18) +
  labs(x = element_blank(),
    y = 'Life expectancy (in years)',
    title = 'Life expectancy over time')
line_chart

Creating interactive Visualizations - ggiraph()

Content for this topic has been sourced from Albert Rapp’s work. Please check out the work for detailed information.

  • We want to make geom_point() and geom_line() interactive
  • This can be done by
    • changing geom_point() to geom_point_interactive()
    • changing geom_line() to geom_line_interactive()
    • adding the data_id aesthetic
line_chart <- mean_life_exps %>%
  ggplot(aes(x = year, y = mean_life_exp, colour = continent, data_id = id)) +
  geom_line_interactive(linewidth = 2.5) +
  geom_point_interactive(size = 3.5) +
  theme_minimal(base_size = 18) +
  labs(x = element_blank(),
    y = 'Life expectancy (in years)',
    title = 'Life expectancy over time')
line_chart

girafe(ggobj = line_chart)

Creating interactive Visualizations - ggiraph()

Content for this topic has been sourced from Albert Rapp’s work. Please check out the work for detailed information.

  • Currently, when you click on a line, it appears orange
  • We want to set our interactivity in such a way that when we click on a line, all other lines must fade away
  • This can be done by passing hover options to our interactive chart
girafe(ggobj = line_chart,
  options = list(
    opts_hover(css = ''), ## CSS code of line we're hovering over
    opts_hover_inv(css = "opacity:0.1;"), ## CSS code of all other lines
    opts_sizing(rescale = FALSE)),
  height_svg = 6,
  width_svg = 9)

Creating interactive Visualizations - ggiraph()

Content for this topic has been sourced from Albert Rapp’s work. Please check out the work for detailed information.

  • Next, let us try to make an inteactive boxplot using gapminder_data
  • We want to make a scatterplot of life expectancy and population across continents for the year 2007
  • First, we would make the interactive scatterplot
scatterplot_graph <- gapminder_data %>%
  filter(year == 2007) %>%
  ggplot(aes(x = life_exp, y = pop, fill = continent, data_id = id)) +
  geom_point_interactive(aes(col = continent), size = 3) +
  theme_minimal(base_size = 18) +
  labs(y = 'Population',
    title = 'Life expectancy vs Population in 2007')

scatterplot_graph

  • Next, to make all other continents’ points fade away when we click on one continent points, we use hover options
girafe(ggobj = scatterplot_graph,
  options = list(opts_hover(css = ''),
    opts_hover_inv(css = "opacity:0.1;"),
    opts_sizing(rescale = FALSE)),
  height_svg = 6,
  width_svg = 9)

Combining ggiraph and patchwork Together

Content for this topic has been sourced from Albert Rapp’s work. Please check out the work for detailed information.

  • You can combine two interactive charts using patchwork to show the interaction of different variables at the same time
  • However, this can only be done when the underlying data_id is the same
girafe(
  ggobj = scatterplot_graph + plot_spacer() + line_chart + plot_layout(widths = c(0.45, 0.1, 0.45)),
  options = list(opts_hover(css = ''),
    opts_hover_inv(css = "opacity:0.1;"), 
    opts_sizing(rescale = FALSE)),
  height_svg = 8,
  width_svg = 12)

Do it Yourself - 6

  • Use penguins data from palmerpenguins package. Make an interactive scatterplot of bill_length_mm and body_mass_g, using species for the colour
  • Next, calculate the number of species for each year. Make an interactive line chart for the same
  • Combine both the interactive charts